Ml Ops

A buzzword I don't really like, but sadly applies to my work.
ML Ops Overview, definition & architecture
- "Examine how ML processes can be automated & operationalized"
- Methodology
  - Literature survery
  - Interviews
- Principles = Best Practices
  - CI/CD automation – fast feedback for build, test, delivery & deploy
  - Workflow orchestration
  - Reproducibility – same results
  - Versioning – data, model, code for reproduction and tracing
  - Collaboration – on data, model and code
  - Continuous ML training & evaluation --
    - monitoring
    - feedback loop
    - automated ML workflow pipeline
    - + eval run to check for changes in model quality
  - ML Metadata tracking/logging – full traceability
  - Continuous Monitoring – periodic assessment of data, model, code, infra, model perf
  - Feedback loops – eval -> engineering, monitoring -> scheduler, etc.
- Components
  - CI/CD automation
  - Source code – code storing
  - Workflow Orchestration – DAGs
  - Feature Store – offline, online
  - Model Training Infrastructure
  - Model registry – trained models + metadata
  - ML Metadata store
  - Model serving component
  - Monitoring component – includes tensorboard
- People – not so clean
  - Business stakeholder
  - Solution "architect"
  - Data scientist / ML Engineer
  - Data Engineer (Feature engineer)
  - Software Engineer
  - DevOps
  - ML Engineer / ML Ops engineer
- <Standard lifecycle diagram>
  - Can have the monitoring system forward drift detection to the primary system
- Intersection of ML, SWE, DevOps, Data Engineering
- Challenges: organizational, ML changes, operational headaches
- Conclusion:
  - In the real world, we observe data scientists still managing ML workflows manually to a great extent. The paradigm of Machine Learning Operations (MLOps) addresses these challenges.
Follow ups
- Contrast this paper against existing solutions and different systems
- Point S., E. to this paper for interview questions

Backlinks